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ABSTRACT 

Legumes play a vital role in maintaining the nitrogen 
cycle of the biosphere. They conduct symbiotic 
nitrogen fixation through endosymbiotic relation- 
ships with bacteria in root nodules. However, this 
and other characteristics of legumes, including 
mycorrhization, compound leaf development 
and profuse secondary metabolism, are absent in 
the typical model plant Ambidopsis thaliana. 
We present LegumelP (http://plantgrn.noble.org/ 
LegumelP/), an integrative database for compara- 
tive genomics and transcriptomics of model 
legumes, for studying gene function and genome 
evolution in legumes. LegumelP compiles gene 
and gene family information, syntenic and phylogen- 
etic context and tissue-specific transcriptomic 
profiles. The database holds the genomic se- 
quences of three model legumes, Medicago 
truncatula, Glycine max and Lotus japonicus plus 
two reference plant species, A. thaliana and 
Populus trichocarpa, with annotations based on 
UniProt, InterProScan, Gene Ontology and the 
Kyoto Encyclopedia of Genes and Genomes data- 
bases. LegumelP also contains large-scale micro- 
array and RNA-Seq-based gene expression data. 
Our new database is capable of systematic 
synteny analysis across M. truncatula, G. max, L. 
japonicas and A. thaliana, as well as construction 
and phylogenetic analysis of gene families across 
the five hosted species. Finally, LegumelP provides 
comprehensive search and visualization tools that 
enable flexible queries based on gene annotation, 
gene family, synteny and relative gene expression. 

INTRODUCTION 

Legumes have the ability to conduct symbiotic nitrogen 
fixation through endosymbiotic interactions with bacteria 
residing in root nodules. Thus, these plants play a vital 



role in maintaining the nitrogen cycle of the biosphere. 
Some legume species are also important resources for 
oils, fiber, fuel, lumber, medicine, chemicals and horticul- 
tural varieties. 

In addition to root nodulation and nitrogen fixation 
symbiosis with rhizobia, legumes possess many unique 
features that are not found in the typical plant model 
Arabidopsis thaliana, including mycorrhization, compound 
leaf development, a protein-rich physiology, profuse sec- 
ondary metabolism, secondary compounds with valuable 
health-promoting properties, glandular trichome develop- 
ment and border cells within roots. Therefore, legumes are 
considered as another important plant model for studying 
physiology, genomics, plant-microbe interaction, sustain- 
able agriculture, food production, security and renewable 
bioenergy generation. 

Legumes have traditionally been divided into three 
main subfamilies: caesalpinioids, mimosoids and 
papilionoids (1), which all derived from a common 
ancestor around 60 million years ago (2). The papilionoids 
mainly include two sister clades, phaseoleae and trifolieae, 
which constitute >60% of all papilionoids. Of the 
two sister clades, the former mostly consists mostly 
of tropical, herbaceous species such as the soybean 
Glycine max. The latter consists primarily of temperate 
species such as Medicago sativa, a species that is closely 
related to two model legumes, M. truncatula and 
L. japonicus. 

Utilizing bacterial artificial chromosome (BAC)- 
by-BAC, whole-genome shotgun and second-generation 
sequencing approaches, the genomes of three legume 
species G. max (http://www.phytozome.net/soybean), 
L. japonicus (http://www.kazusa.or.jp/lotus) and M. 
truncatula (http://www.medicago.org/genome) have been 
sequenced recently (3-5). The results of these sequencing 
projects have become invaluable resources for legume 
research. Furthermore, large-scale gene expression 
profiling of L. japonicus (http://www.brics.dk/cgi- 
compbio/Niels/index.cgi), G. max (http://digbio.missouri 
.edu/soybean_atlas/) and M. truncatula (http://mtgea 
.noble.org/) have been also performed to characterize 
tens of thousands of genes in these species (6-8). 
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Comparative genomics and transcriptomics approaches 
have empowered gene discovery and gene functional char- 
acterization. For example, Libault et al. (9) systematically 
reviewed the identified transcription factors through 
comparative sequence analysis and expression data. 
Comparative genomics and transcriptomics, coupled 
with comprehensive gene annotation, gene family classifi- 
cation and phylogenetic analysis have also been used suc- 
cessfully to decipher biological processes that are unique 
to legumes, such as nodulation in response to rhizobial 
infection. For example, the MtHAP2.1, MtERN and 
LjNIN genes, which control nodule development, were 
identified by analysis of collinear relationships and gene 
expression profiles (10-12). 

The Legume Information System (LIS) is a community 
portal that hosts a vast quantity of legume-related data, 
including gene sequences, transcript sequences such as ex- 
pressed sequence tags (ESTs), genetic markers, literature 
and external links to multiple legume species (13). The LIS 
also provides useful tools such as orthologous gene and 
evolutionary event analysis, a chromosome visualization 
tool and synteny-view in a genome browser. However, due 
to the complexity of tandem duplication and genomic di- 
vergence, analysis of synteny or comparison of chromo- 
some position only may overlook a large number of 
orthologous genes. A typical example is the flavonoid 
gene family (14). Isoflavonoids, a subset of this family 

(15) , are unique to legume species, and most are 
involved in mediating host specificity within this plant. 
Nevertheless, only a few of these genes could be identified 
by collinear analysis since the tandem duplications that 
produced this family of flavonoids most likely occurred 
after the large-scale genome duplication events. 

Combining the analysis of gene families, phylogenetic 
context and tissue-specific transcriptomic profiles is a 
powerful method for studying complex biological events 

(16) . The online database PLAZA (17) integrates 
large-scale genomics data from several plant species for 
plant evolution research; however, a lack of gene expres- 
sion data make it less effective in inferring species-specific 
gene function and evolutionary history, especially in 
legume species. 

The recent publication of legume genomics and 
transcriptomics data has necessitated the development of 
a comprehensive genomics and transcriptomics database 
of model legumes. Here, we present LegumelP, an inte- 
grative database for comparative genomics and 
transcriptomics of model legumes, for use in studying 
gene function and genome evolution in this important 
plant family. LegumelP currently hosts large-scale 
genomics and transcriptomics data including the genome 
sequences of three model legumes (i.e. M. truncatula, G. 
max and L. japonicas) and two reference plant species (i.e. 
A. thaliana and Populus trichocarpd) with annotation based 
on Uniprot (18), InterProScan (19,20), Gene Ontology 
(GO) (21) and the Kyoto Encyclopedia of Genes and 
Genomes (KEGG) (22) and encompassing 222217 
protein-coding gene sequences. LegumelP also contains 
large-scale gene expression data compiled from 104 
L. japonicas microarrays, 156 microarrays from the M. 
truncatula gene atlas database and 14 RNA-Seq-based 



gene expression data from G. max gene atlas database, 
with profiles on time-course experiments and different 
tissues including four common tissues namely nodule, 
flower, root and leaf. In addition, LegumelP can 
perform systematic synteny analysis across M. truncatula, 
G. max, L. japonicas and A. thaliana, as well as construct 
the gene family and perform gene family-wide phylogen- 
etic analysis across the five hosted plant species. Finally, 
LegumelP can perform comprehensive search and visual 
representation to enable flexible queries based on gene 
annotation, gene family, synteny and relative gene 
expression. 

LegumelP is freely available at http://plantgrn. noble. 
org/LegumelP/ 

DATABASE PRODUCTION 

Data source and process 

The protein-coding and amino acid sequences for M. 
truncatula, L. japonicas and G. max were obtained from 
http://www.medicago.org/genome/download/, ftp://ftp. 
kazusa.or.jp/pub/lotus/lotus_r2. 5/ and ftp://ftp.jgi-psf. 
org/pub/JGI_data/phytozome/v7.0/Gmax/annotation/, 
respectively. Data for the two outgroup reference species, 
A. thaliana and P. trichocarpa, were acquired from ftp:// 
ftp.arabidopsis.org/home/tair/Genes/TAIR10_genome_ 
release/ and ftp://ftp.jgi-psf.org/pub/JGI_data/ 
phytozome/v7.0/Ptrichocarpa/annotation/, respectively. 
In total, LegumelP integrates 222 217 protein-coding 
gene sequences and 221 706 amino acid sequences. 

Large-scale microarray and RNA-Seq-based gene ex- 
pression data for M. truncatula, L. japonicas and G. max 
were obtained from http://mtgea.noble.org/, http://cgi- 
www.cs.au.dk/cgi-compbio/Niels/index.cgi and http:// 
digbio.missouri.edu/soybean_atlas/, respectively (6-8,23). 
Gene expression data for L. japonicas are available only 
for version 1.0 genome sequences in the atlas database. 
Therefore, we remapped the Affymetrix L. japonicas 
GeneChip probesets using BLAST to align CDS se- 
quences (Version 2.5) against the GeneChip probe set 
target sequences downloaded from the ArrayExpress 
(http://www.ebi.ac.uk/arrayexpress/) database (E- value 
< le-10). Only the highest-scoring hit was selected. As 
the result, LegumelP integrates data from 104 L. japonicas 
microarrays, 156 M. truncatula microarrays and 14 
RNA-Seq-based gene expression profiles for G. max, 
with profiles on different tissues including four common 
tissues (nodule, flower, root and leaf) for all three model 
legume species. LegumelP also integrates data from add- 
itional tissues and time-course experiments for individual 
species. 

Comprehensive gene annotation 

Genome sequences were annotated using a series of 
manually curated standard databases. The protein se- 
quences were first searched against UniProt using 
BLASTP with a cutoff £- value <le-04. The top five 
most meaningful query results were considered valid. 
This method was also used to annotate protein sequences 
against GO, KEGG, Transporter Classification Database 
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(TCDB) (24) and Plant Transcription Factor Database 
(PlantTFDB) (25). Conserved domains between protein 
sequences were identified with InterProScan software 
using its default Is-value cutoff thresholds (20). 

Systematic synteny identification 

Alignment of syntenic regions between non-legume and 
legume species is an effective approach for identifying 
patterns of evolutionary conservation and divergence 
across genomes. LegumelP employs the DAGchainer 
program (26) to identify syntenic regions between M. 
truncatula, L. japonicas, G. max and A. thaliana. First, 
we performed an all-by-all BLASTN search with an E- 
value cutoff <le-10 to identify intraspecies paralogous 
pairs and interspecies homologous pairs. Then, we 
applied DAGchainer (parameters Z = 12, D = 10, g = 1 
and a = 5) to discover orthologous pairs of interspecies 
collinear regions. Parameters with -s and -i in DAGc 
hainer were applied to identify collinear paralogous 
pairs in the same species. To all the identified homologous 
pairs within syntenic regions, we applied the F3x4 model 
from the PAML4.0 software package (27) to estimate the 
ratio of the number of non-synonymous substitutions/ 
non-synonymous site (Ka) to the number of synonymous 
substitutions/synonymous site (Ks) (i.e. Kei/Ks). 

Cross-species gene family and phylogenetic analysis 

Due to the frequent occurrence of tandem duplication or 
sequence diversity, the order of orthologous genes within a 
collinear/syntenic region may have become disrupted 
during evolutionary development (28). To better study 
gene function and genome evolution, two groups of 
putative gene families were constructed in the five 
species based on protein-coding sequence similarity. This 
was accomplished by the TribeMCL (29) and OrthoMCL 
(30) clustering algorithms that are complementary to each 
other. Although TribeMCL outputted fewer gene families, 
each resultant family consisted of more member genes but 
with a higher false-positive rate that was likely due to the 
inherent nature of BLAST hits within the TribeMCL al- 
gorithm. In contrast, OrthoMCL yielded a large number 
of smaller gene families with a lower false-positive rate. To 
construct the TribeMCL gene families, we first performed 
an all-by-all BLASTP search of the protein-coding nucleo- 
tide sequences from the five plant species using an _E-value 
<le-10 as the cutoff threshold. TribeMCL (with default 
option / = 2.0) was then employed to delineate large gene 
families based on the BLAST results. We applied the 
OrthoMCL method to construct gene families with 
fewer member genes and a lower false-positive rate 
based on the same blast result. Application of both algo- 
rithms resulted in the grouping of 95.70% of 212 653 
protein-coding genes into 12 166 TribeMCL gene 
families and 70.40% of all protein-coding genes grouped 
into 19 315 OrthoMCL gene families. 

To construct phylogenetic trees for the gene families, 
multiple sequence alignment was first performed using 
the MUSCLE software (31). Unrooted trees were then 
created using PHYL1P software (including the seqboot, 



proml and consensus programs) with 100 bootstrap 
replications. 

Analysis of large-scale gene expression data 

To enable comparative transcriptomics analysis across the 
three model legume species, LegumelP integrates 
microarray-based and RNA-Seq-based gene expression 
profiling data from four common tissues, namely 
nodule, root, flower and leaf, in its core expression table. 
Furthermore, large-scale gene expression profiles for add- 
itional tissues and time-course experiments were also 
included for individual species. The hosted gene expres- 
sion profiling experiments are described in the 
Supplementary Materials and Methods. We used a 
rank-based method to normalize the gene expression 
data. Briefly, the genes identified by each microarray or 
RNA-Seq data set were first ranked from lowest to highest 
in terms of expression. The ranks were subsequently 
divided by the total number of expressed genes. 
Furthermore, the expression values from individual 
tissues were calculated by averaging the ranks of different 
experimental conditions applied to the same tissue. 

Clustering of gene families from the three legume 
species was estimated based on the Pearson's correlation 
coefficient of normalized expression in the four tissues 
using the hierarchical clustering algorithm (32). 

Database development 

LegumelP was developed using the Java and Groovy lan- 
guages. The system runs on a Linux-based RESIN J2EE 
web server architecture using MySQL as its database man- 
agement system. Circos software (33) and Gbrowse_syn, 
which is a Gbrowse-based synteny browser (34), were 
adopted for visualization of macro- and micro-syntenic 
relationships, respectively. The interactive phylogenetic 
tree is rendered by Archaeopteryx (35) and gene expres- 
sion profiles are plotted using the OpenFlashChart 
package (http://teethgrinder.co.uk/open-flash-chart/). 
Gene cluster is depicted as heatmap in the format of a 
background-colored HTML table. 

User-friendly web interface for data access 

LegumelP provides a comprehensive web interface for 
searching and exploring genes, gene families, syntenic 
regions and gene expression patterns. For example, 
through a simplified Keyword Gene Search interface, 
users can search LegumelP by gene name, description, 
family and specific biological classification system, such 
as a GO term, InterProScan domain name, transporter 
family, transcription factor family, KEGG pathway or 
compound name. In the Advanced Gene Search page, 
more complex searching criteria can be used, including 
combinations of keywords for querying quantitative 
tissue-specific gene expression patterns. The search 
results are listed in a table with links for batch download- 
ing, a detailed page of comprehensive gene annotations, 
plots depicting the transcriptomics profile if applicable 
and sequence information. Users can also navigate to cor- 
responding synteny and gene family pages using the 
'TribeGroup', 'OrthoGroup' and 'Included gene in 



D1224 Nucleic Acids Research, 2012, Vol. 40, Database issue 





Synteny in chr05, Medicago truncatula 



N < ► M 



| Sort by Mapped species | T | 

Chromosome(or 



B 



No. 1 page / Total 13 pages, 



Species. 



Mapped 
romosome 
Coiitig) 



- ^ Range ^Genes^ S*^es Chromosome(or Mapped Range Genes(Mapped Genes 



truncatula 
Medicago 
truncatula 
Medicago 
truncatula 



chr05 
chr05 
chr05 



11913616.11996024 Link 



Arabidopsis 
thaliana 



Chrl 



13S17M1..1*M8900 Link A ™ bi f°P sis C hll 
thaliana 



18884769..19015761 Link 



Arabidopsis 
thaliana 



1 1575434.. 1 1609591 



85495 18..8569927 



9155134..9195102 



Range) 

Link 
T inlf 
Link 



only 

4.222 Link 

5.6836 Link 
3.1946 Link 



B At. Gm 



Arabid 
Chrl 


psis thaliana 

1T1G32170.1 
WW — 


1T1G32180.1 




(T1G32190. 
cnr-m — 4=1 
VUG32190. 
nr ai — i 


L 
! 


W1G32200.: 


HT1G32210.1 

«+~^l~U 

AT1G32 

J . . . 


AT1G3 
>2t>.l 




=|T1G32200.; 






11580k 


11590k 




H 1 , 1 1 1 V 

11600k 










H 1- 

610k 



Medibago truncatula (reference) 



\ 



\ 



a* 
chr05 



^tntt — it ti Hi h 
< ■ tit — ii 
- 1 ■ tit — it ti 



Gm09 

I I I' " h" I I 

'590k 40580k 40570k 40560k 40550k 40540k 4*530 



Glycine max (reverse) 



Glyl»a09g34130.1 



Gly naOS 
Gly laOS 
Gly raO! 



3»ol™ 



4^i"t»t 
i ti tit 



MOOk 
tit i $ 



Figure 1. Web interfaces for (A) searching, (B) exploring and (C) visualizing macrosyntenys and microsyntenys. 



synteny' links. The phylogenetic tree and gene cluster 
heatmap are provided in detail on the gene family page. 

LegumelP provides a simplified page to allow users to 
explore syntenic regions by chromosome or Contig ID. 
The macrosynteny outputs, including interspecies and 



intraspecies syntenic regions are represented as Circos 
maps (Figure 1A) and a summary table (Figure IB) with 
links to detail pages, such as visualization of microsynteny 
in the GBrowse_syn module (Figure 1C) and a list of 
genes within the syntenic region. 
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Figure 2. (A) Expression cluster displayed as heatmap for all expressed member genes in TribeMCL00867 and (B) OrthoMCL.07722, which include 
SymRK genes. 



LegumelP also integrates BLAST search interfaces, 
allowing users to search homologous genes or protein se- 
quences based on sequence similarity. 



DEMONSTRATION OF THE UTILITY OF LEGUMEIP 

Below we demonstrate the utility of LegumelP in discover- 
ing and characterizing gene candidates in legumes. 
Additional examples are available in the Help page of 
the LegumelP web server. 

Mining UGT and p450 genes in M. truncatula with gene 
search tools 

The UDP-glycosyltransferase (UGT) gene family (36), 
which is reportedly involved in the biogenesis of important 
secondary metabolites such as fiavonoids, is highly 
enriched in the legume M. truncatula compared to A. 
thaliana. Therefore, much research on plant secondary 
metabolism has been focused on this family. Since these 
enzymes feature a conserved InterProScan domain, 
IPR002213, we used this domain ID as a keyword to 
search for UGT genes in the M. truncatula genome. This 
search yielded 351 genes as potential UGT candidates for 
further study. 

CYP84 constitutes a family of cytochrome 
P450-dependent mono-oxygenases defined by ferulate 
5-hydroxylase activity, which is reportedly mediates a 
plant defense mechanism (37). Thus, we searched for 
CYP84 genes in the M. truncatula genome using the 
keywords 'ferulate' and '5-hydroxylase' and found 10 can- 
didate genes. Microarray data showing detectable expres- 
sion levels (i.e. higher than the 10th percentile in at least 
one of the four common tissues analyzed) were available. 



These genes likely function as P450-dependent mono- 
oxygenases. However, experimental validation is still 
required as many of the P450 subfamilies are quite 
similar in terms of both sequence and structure. 

In both cases, users can batch download all of the can- 
didate sequences by clicking on links located on the upper 
right-hand corner of the result pages. 

Mining SymRK genes for symbiosis analysis in legumes 

Symbiosis with rhizobia in the nodule is the source of 
nitrogen fixation in legumes. The leucine-rich repeat 
receptor kinases [also known as symbiosis receptor-like 
kinase (SymRK)] are reportedly involved in the signaling 
pathway that mediates early root response to bacterial and 
fungi infection in epidermal tissues of root nodules. These 
kinases also mediate the uptake of symbiotic bacteria 
and fungi into plant cells (38,39). In addition, SymRKs 
play an essential role in nodulation initiation. 
For example, SymRK in Lotus (previous gene name, 
CM0177.3340.r2.m), together with other Nod-factor 
genes, was reportedly involved in signaling that leads to 
epidermal calcium spikes (40). 

Using the keyword 'SymRK', we retrieved seven 
SymRK genes using LegumelP. The results further 
indicated that all of these genes were assigned into one 
family, the TribeMCL00867 group, or according to the 
OrthoMCL method, the OrthoMCL07722 group. 
Corresponding probesets were available on the 
Medicago GeneChip for four of the seven genes, and 
each of these was highly expressed in nodule and root 
(Figure 2), suggesting their possible roles in nodule initi- 
ation. The reconstructed phylogenetic tree demonstrated 
that these genes most likely also belong to the same clade 
(Figure 3) and gene family, OrthoMCL07722. Synteny 
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Figure 3. Phylogenetic tree of the TribeMCL00867 gene family. 



analysis further indicated their derivation from common 
ancestral genes (Figures 4 and 5). Analysis of the phylo- 
genetic tree enabled us to identify two P. trichocarpa genes 
located in the same clade (Figure 3), suggesting that the 
function of SymRK genes may not be specific to only 
legume species and may be related to ectomycorrhiza sym- 
biosis in P. trichocarpa. Altogether the results produced 
from LegumelP provide additional support for the 
SymRK gene family as the common genetic basis of 
root nodule symbiosis (38,39). 

CONCLUSIONS AND FUTURE PERSPECTIVES 

The rapid growth of legume-related genomics and 
transcriptomics data demands development of integrated 
databases and advanced bioinformatics analysis tools for 



not only efficiently managing, storing, retrieving and 
sharing data, but also for effectively integrating, analyzing 
and mining large volumes of highly complex information. 

LegumelP compiles data and related analytical tools 
such as comprehensive Uniprot-/InterProScan-/GO-/ 
KEGG-based gene annotations, relative gene expression 
data, gene family classification and macrosynteny 
analysis. The data and analytical tools are thoughtfully 
organized to enable quick searches for genes of interest 
through user-friendly web interfaces. Transcriptomics 
profiling, synteny and phylogenetic analysis are powerful 
tools that can be used to discover gene function, including 
identifying genes associated with the symbiotic nitrogen 
fixation process in legumes. Moreover, synteny and phylo- 
genetic analysis are helpful in better understanding the 
evolution of terrestrial plants such as legumes. All of 
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Figure 4. SymRK orthologous genes within a syntonic region common to L. japonicus, G. max and M. truncatula. 
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Figure 5. SymRK paralogous genes within a syntenic region in G. max. 



these features demonstrate the enormous potential of 
LegumelP as a vital tool in studying fundamental bio- 
logical questions. 

We are committed to continually improving LegumelP. 
Additional microarray- and RNA-Seq-based gene expres- 
sion data will be populated into the LegumelP database as 
it is made available in public repositories. In addition, we 
will integrate large-scale genomic and EST sequences from 
different sources, such as the Medicago Hapmap project 
(http : / /www.medicagohapmap . org/) . 



SUPPLEMENTARY DATA 

Supplementary Data are available at NAR Online: 
Supplementary Material and Methods. 
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